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Abstract 

The purpose of this study is to help ensure that strategies for differential item functioning (DIF) 
detection for students with disabilities are appropriate and lead to meaningful results. We 
surveyed existing DIF studies for students with disabilities and describe them in terms of study 
design, statistical approach, sample characteristics, and DIF results. Based on descriptive and 
graphical summaries of previous DIF studies, we make recommendations for future studies of 
DIF for students with disabilities. 
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Overview 



Differential item functioning (DIF) refers to group differences in performance on a test 
item that cannot be explained by group differences in the construct targeted by the item (Crocker 
& Algina, 1986; Clauser & Mazor, 1998). Test items are identified as exhibiting DIF when, after 
matching examinee groups by a measure of ability, the performance of one group is significantly 
higher than the other group, on average. When DIF is found to occur, it means that a test item is 
measuring traits or abilities that are secondary to the targeted ability. For students with 
disabilities, such secondary traits could be a test taker’s ability to access the math content in a 
word problem or the ability to respond to a computer-delivered constructed response item with a 
keyboard, for example. For such students, opportunity to learn the content may also be 
considered a secondary trait. 

Secondary traits measured by items showing DIF may be relevant or irrelevant to the 
targeted ability. When test items measure secondary traits or abilities that are irrelevant to the 
intended measure for some groups, such items are considered biased. Item bias is one aspect of 
fairness in testing and test use (American Educational Research Association, American 
Psychological Association, & National Council on Measurement in Education (1999).To ensure 
test fairness, DIF statistical methodology is used to empirically identify items that are performing 
differently across focal and reference groups after matching examinees based on ability, and 
human judgment is used to decide whether an item showing DIF is biased based on its 
characteristics (Zieky, 1993; Zumbo, 1999). When an item shows moderate to high levels of 
DIF, the item is typically reviewed by content experts. In the test development stage, an item 
showing DIF may either remain as is, be revised, or be deleted from the item pool. In an 
operational setting, an item showing DIF may be removed from the calculated test score 
depending on the results of the item review. 

Over the last 5 years, there has been a substantial increase in the number of studies using 
DIF methods to compare students with and without disabilities. Since students with disabilities 
are not a homogeneous subpopulation (Johnstone, Thompson, Bottsford-Miller, & Thurlow, 
2008), comparison groups must often be disaggregated based on specific disability subtypes. 
While small sample sizes had limited the number of DIF studies for students with disabilities 
historically, recent changes have provided opportunities to conduct item-level analyses and to 
make judgments about fairness for more specific disability subgroups. Such changes include 
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increased participation of students with disabilities in large-scale statewide assessments 
(Thurlow, Quenemoen, Altman, & Cuthbert, 2008) and postsecondary education (U.S. 
Department of Commerce, Bureau of the Census, 2004, 2006), federal requirements for ensuring 
sound technical quality of alternate assessments taken by some students with disabilities (United 
States Government Accountability Office, 2009), and novel approaches to item analysis for low 
incidence disability subtypes (described below). 

Given that multiple statistical approaches are available to study DIF and that multiple 
decisions are made once DIF items are detected (Clauser & Mazor, 1998), we surveyed existing 
DIF studies for students with disabilities to inform recommendations for future studies on DIF 
for students with disabilities. The aim of the current synthesis is to help ensure that DIF studies 
for students with disabilities are appropriate and lead to meaningful results. The following 
sections provide an overview of standard practice for studying DIF, a description of the 
characteristics of students with disabilities that may impact DIF analyses, a summary of previous 
research, and recommendations for studying DIF for students with disabilities. 

Standard Practice for Studying Differential Item Functioning (DIF) 

Examining DIF is not simply algorithmic; rather, judgments need to be made at various 
steps in the process. Such decisions include (a) identifying the comparison groups, (b) choosing a 
matching criterion, (c) choosing a statistical approach, and (d) interpreting DIF results, including 
what to do with items showing DIF (Clauser & Mazor, 1998). However, some standard practices 
in DIF analyses may lead to more valid inferences. These include ensuring that scores on the 
matching criterion are reliable and valid (Clauser & Mazor, 1998), using sufficient sample sizes 
in the reference and focal groups (Zieky, 1993), obtaining a matching criterion from a 
standardized administration across comparison groups (Dorans & Holland, 1993), and examining 
focal and reference groups with similar ability distributions, particularly when methods of 
detecting DIF that are not based on item response theory (IRT) are used (e.g., Klockars & Lee, 
2008; Mazor, Clauser, & Hambleton, 1992; Narayanan & Swaminathan, 1994, 1996). 

Reliability and validity of matching criterion. In DIF analyses, the matching criterion 
should be a valid measure of the target ability measured by the items. Several decisions need to 
be made regarding the choice of matching criterion that may impact validity (Clauser & Mazor, 
1998). When ability differences exist between the reference and focal groups, an unreliable 
criterion will lead to the most discriminating items being flagged for DIF (Clauser & Mazor, 
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1998). When an internal matching criterion is used, such as the total test score, the matching 
criterion should also be minimally impacted by DIF items (Clauser & Mazor, 1998). Purification 
of the matching criterion can be accomplished with statistical procedures (e.g., using an iterative 
Mantel-Haenszel [MH] procedure [Holland & Thayer, 1988], or by selecting the option using the 
SIBTEST software [Shealy & Stout, 1993]). Administering test items to the reference and focal 
groups under standardized testing conditions can also help ensure that the matching criterion is 
essentially free from DIF. 

Standardized administration. Studies for DIF typically compare students under the 
same testing conditions (Zieky, 1993). This comparison helps to ensure that examinees from the 
reference and focal groups are appropriately matched when using an internal matching criterion, 
which is essential for obtaining valid DIF results (Clauser & Mazor, 1998). When the matching 
variable is based on incomparable measures, such as when the measurement conditions have 
been altered for one group or for some members of one or both groups, DIF techniques are not 
appropriate unless the measures are shown to be comparable. 

Sample size. Sufficient sample sizes in both focal and reference groups are necessary in 
order to have enough power to detect differences in performance across groups matched on 
ability. Based on research by Narayanan and Swaminathan (1994) and Rogers and Swaminathan 
(1993), sample sizes of 200 to 250 per group will likely have enough power to detect DIF using 
non-IRT methods including MH (Holland & Thayer, 1988), logistic regression (Swaminathan & 
Rogers, 1990), and SIBTEST (Shealy & Stout, 1993). At one time, the practice at ETS when 
using MH was for the smaller group to comprise at least 100 examinees, with the total number of 
examinees equal to 500 or more for test development; for postadministration and prescore 
reporting, examinees in the smaller group must total at least 200 with a minimum of 600 
combined examinees; and at least 500 people must be included in the smaller group postscore 
report when analyses are conducted on a group that has never been studied (Zieky, 1993). 
However, sample size requirements have varied over time (e.g., recent guidelines included a 
300/700 rule) and across testing programs and purposes. IRT-based methods for detecting 
DIFgenerally require larger sample sizes in order to estimate model parameters for both the 
reference and focal groups (Clauser & Mazor, 1998). Alternative approaches, described below, 
have been used to study DIF for smaller groups, but the success of these methods has yet to be 
studied. 
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Differences in ability distributions. When the ability distributions of the reference and 
focal groups differ, the efficacy of the matching criterion and the results from DIF analyses can 
be impacted. Previous methodological work has evaluated the impact of differences in ability 
distributions on DIF results. Studies have found that large mean differences in ability 
distributions across groups are associated with decreased power to detect DIF using non-IRT 
methods (e.g., Klockars & Lee, 2008; Mazor, Clauser, & Hambleton, 1992; Narayanan & 
Swaminathan, 1994, 1996). In addition, the MH statistic and its modifications have been shown 
to have higher Type I error rates as ability distributions become more discrepant and 
discrimination parameters differ across groups (Fidalgo & Madeira, 2008). 

DIF and Students With Disabilities 

The feasibility of studying DIF for students with disabilities has improved as more 
students with disabilities are being assessed (e.g., on state criterion-referenced assessments, 
Center on Education Policy, 2009). However, complexities still exist. For example, some DIF 
studies for students with disabilities have had reference and focal groups tested under different 
conditions and different proficiency distributions that can influence the proportion and type of 
items that are flagged for DIF (Dorans & Holland, 1993; Fidalgo & Madeira, 2008; Sireci, 

2009). 

Some students with disabilities take assessments with accommodations. Testing 
accommodations are intended to remove barriers to accessing test items due to students’ 
disabilities. That is, “the psychometric function of accommodations is to increase the validity of 
inferences about students with [disabilities] by offsetting specific disability-related, construct- 
irrelevant impediments to performance” (Koretz & Hamilton, 2006, p. 562). For example, a 
braille format test allows students who are blind to access items that would be unreadable in 
paper or computer format. In testing, the term accommodations typically refers to changes in the 
test that are not intended to alter the construct being measured. The term modifications refers to 
changes in the test that do alter the construct being measured. The designation of a test condition 
alteration as an accommodation or a modification may be based on policy or research and may 
change based on the purposes of the test. For example, state departments of education differ in 
their designation of the read-aloud accommodation on a reading assessment as an 
accommodation or a modification (Laitusis, 2008). 
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DIF analyses are routinely performed on groups with different measurement conditions 
because of accommodation use. However, using reference and focal groups that differ in whether 
they received testing accommodations may impact the validity of the matching criterion. When 
accommodation use itself alters most or all of the items on a test such that they become a 
measure of the students’ ability to use the accommodation as well the targeted ability, then 
accommodation use will introduce error in the DIF results because an appropriate internal 
matching criterion will not be available. 

Small sample sizes and nonoverlapping proficiency distributions are two common 
characteristics of data from students with disabilities that may influence the results from 
statistical analyses including DIF (Sireci, 2009). Historically, students with specific types of 
disabilities either have been excluded from DIF studies or have been included in analyses by 
aggregating students with different types of disabilities under an umbrella classification. This 
latter option is sometimes chosen because of small sample sizes in groups related to specific 
disability subtypes since small sample sizes can lead to low power to detect performance 
differences between groups. In addition, students with disabilities, particularly those with 
cognitive disabilities, tend to have lower performance levels than students without disabilities 
(e.g., Klein, Wiley, & Thurlow, 2006; Ysseldyke et al., 1998) due, in part, to disability 
classification criteria that include low achievement levels or abilities. As such, a focal group 
comprising students with disabilities may have a test score distribution that is positively skewed 
with a mean far below the reference group, which could lead to too many items, particularly easy 
items, being flagged for DIF (Sireci, 2009). Low test reliability for students with disabilities is 
also a concern, particularly for tests with a broader proficiency range, such as state accountability 
assessments (Koretz & Hamilton, 2006). 

Several other characteristics of students with disabilities and assessments taken by some 
students with disabilities can impact DIF results. Such characteristics include (a) performance 
assessments with too few items or insufficient statistical properties; (b) complications from 
accommodation use such as discrepancies between assignment and use, database errors, or no 
information on accommodations available to the researcher; (c) low performance, which may 
lead to low precision of estimates if the test is linear and is targeted to an ability level far above 
the subgroup ability level; and (d) factors that may differentially impact the underlying ability 
distribution of subgroups (e.g., students with physical vs. cognitive disabilities, opportunity to 
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learn impacted by disability). In addition, classification of disability status is often inconsistent 
(Koretz & Hamilton, 2006) and there are numerous stages where errors in coding student 
characteristics in a database can occur, consequently contributing to inaccurate DIF results. 

Design Frameworks for Differential Item Functioning (DIF) Studies 

The following describes three different design options that have been used to study DIF. 
The subscript s denotes scores obtained from a standard administration. The subscript a denotes 
scores obtained under an accommodated administration. 

Design 1 

Design 1 (see Table 1) is a standard DIF design. This design can be used to determine 
whether DIF exists for students with disabilities relative to students without disabilities (i.e., 
Group 1 is students without disabilities and Group 2 is students with disabilities). No students 
receive accommodations under this design. 1 Studies we surveyed that included comparisons with 
Design 1 were Bolt and Ysseldyke (2006), Bennett, Rock, and Kaplan (1985), Cline, Stone, and 
Cook (2008), Engelhard (2009), Kato, Moen, and Thurlow (2009), and Steinberg, Cline, Ling, 
Cook, and Tognatta (2008). 

Table 1 



Design 1 





Item 


Matching 




x s 


variable (Y s ) 


Group 1 


@ 


@ 


Group 2 


@ 


@ 



Note. Groups 1 and 2 receive item X s and have the matching variable Y s under standard 
conditions. 

Design 2 

Design 2 (see Table 2) is routinely used with standard DIF methods to study the impact 
of accommodation use (e.g., Bolt & Ysseldyke, 2006; Laitusis, Cook, & Aicher, 2004; Cohen, 
Gregg, & Deng, 2005; Finch, Barton, & Meyer, 2009; Ling & Stone, 2008; Stone, Cook, 
Laitusis, & Cline, 2010). In some situations, this design is the only feasible option (e.g., studying 
DIE for blind students tested with items delivered in Braille relative to sighted students). Design 
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2 is also used when DIF is performed on existing datasets in which students in one or both 
groups use accommodations but either they are different in the different groups (e.g., Bennett, 
Rock, & Kaplan, 1985; Laitusis, Maneckshana, & Monfils, 2009) or it is unknown whether 
students in either group received accommodations (e.g., Abedi, Leon, & Kao, 2008). 



Table 2 
Design 2 





Item 


Matching 


Item 


Matching 




x s 


variable 


x a 


variable 






(Y s ) 




(Ya) 


Group 1 


@ 


@ 






Group 2 






@ 


@ 



Note. Group 1 receives item X s and matching variable Y s under standard conditions; Group 2 
receives item X a and matching variable Y a under accommodated conditions. 

In most situations. Design 2 violates the assumptions of standard DIF analysis because 
there is no common matching variable (i.e., Y s ^ Y a ) and evidence of DIF would likely be a 
function of differences between Y s and Y a . Were standard DIF procedures to be used with such a 
design, evidence that Y s and Y a measure the same thing should be provided. 2 Design 3 studies 
rely on the assumptions that accommodations are appropriately administered to those who need 
them and that they do not alter the construct being measured. Such assumptions should also be 
verified to provide evidence that the matching criterion, and consequently the DIF results, is not 
impacted by accommodation use. 

Design 3 

Design 3 (see Table 3) is ideal for examining the effects of an accommodation (or bundle 
of accommodations) that can be administered to both groups (e.g., a read-aloud accommodation). 
An example of this data collection design can be found in Laitusis (2010). Among the studies 
surveyed, Engelhard (2009) and Ling and Stone (2008) used this design. In this design, DIF 
procedures could be used separately for both the standard administration and the accommodated 
administration, and results compared to see if the change in measurement conditions associated 
with accommodation use alters the results of the DIF analysis. 
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Table 3 
Design 3 





Item 


Matching 


Item 


Matching 




x s 


variable 


x a 


variable 






(Y s ) 




(Ya) 


Group 1 


@ 


@ 


@ 


@ 


Group 2 


@ 


@ 


@ 


@ 



Note. Group 1 receives both item X s and matching variable Y s under standard conditions and 
item X a and matching variable Y a under accommodated conditions; Group 2 receives both item 
X s and matching variable Y s under standard conditions and item X a and matching variable Y a 
under accommodated conditions. 



Method 

Existing DIF studies on students with disabilities were surveyed from among research on 
ETS testing programs and external research published in peer-reviewed journals. We collected 
information from the studies including choice of reference and focal groups, how 
accommodation use is treated, and the statistical method(s) used to conduct item-level analysis 
and address test fairness. In addition, we recorded item-level information including the number 
of DIF items, the magnitude of DIF, and the difficulty of the DIF items, when available. The 
number of items, the sample sizes for reference and focal groups, and the observed score 
distribution (i.e., mean and variance) were also recorded. We chose to focus on comparisons that 
resulted in evidence supporting DIF in order to obtain information on factors associated with 
finding DIF items for students with disabilities. 

We summarized the studies by looking for trends in proportion of items being flagged for 
DIF based on choice of reference and focal groups, treatment of accommodation use, and type of 
disability (i.e., cognitive vs. physical). 4 The studies are summarized below both descriptively and 
graphically. The graphical summary focuses on the relationship between the percentage of items 
flagged for DIF and item difficulty. 

Summary of Differential Item Functioning (DIF) Studies for Students with Disabilities 

We collected 17 unique studies on DIF for students with disabilities published between 
1986 and 2010; among those, 9 were conducted by ETS researchers. The appendix contains 
information about the number of DIF comparisons, type of assessment, and studied disability 
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groups. The 17 studies comprised 123 separate DIF comparisons that resulted in finding items 
that exhibited DIF. Of the 123 comparisons, 72% used students without disabilities as the 
reference group; in the remaining comparisons, both reference and focal groups comprised 
students with disabilities. Slightly less than half of the comparisons involved studying the impact 
of accommodations, 28% did not involve accommodation use, and for the remaining 24% of the 
comparisons, the authors did not know whether or not students received accommodations. The 
ability of the focal and reference groups, based on disability type and accommodation status, was 
similar for 54% of the comparisons and different for 23%. For the remaining comparisons, 
information on the types of disabilities the students had was unavailable so it was unclear 
whether cognitive ability was similar or different across reference and focal groups. 

Based on the observed score distribution, the mean score of the focal group was lower 
than the reference group for 66% of the comparisons, the means were similar for 18% of the 
comparisons, and the focal group had a higher mean than the reference group in 6% of the 
comparisons. All comparisons in which the observed score distributions were similar involved 
students with disabilities in both the reference and the focal groups. Seventy-four percent of the 
comparisons that used students without disabilities in the reference group had lower observed 
score means for the focal group, and among those comparisons, 48% comparisons involved 
studying the impact of accommodations. Of the comparisons in which students with disabilities 
were in both focal and reference groups, 32% had a lower observed score mean for the focal 
group. Among these comparisons, all but one involved studying the impact of accommodations. 

The average sample size for the reference group across comparisons was 53,620 with a 
minimum of 92 and the median equal to 5, 949. 5 The focal group average was 5,419 with the 
median equal to 485 and a minimum of 74. For comparisons in which students with disabilities 
were in both reference and focal groups, the average sample size of the reference group was 
2,495 and the average sample size for the focal groups was 665. Across all comparisons, the 
mean number of items was 53 with a minimum 6 of 8 and a maximum of 75. Thirty-eight percent 
of the comparisons were carried out with the MH statistic. Other methods used were SIBTEST 
(10%), logistic regression (13%), and IRTLRDIF (Thissen, 2001; 13%). We found several novel 
approaches to studying DIF for students with disabilities. Johnstone, Thompson, Moen, Bolt, and 
Kato (2005) used a combination of item analyses including item ranks, item total correlation, and 
DIF with contingency tables and IRT-based methods. Engelhard (2009) framed DIF analyses in 
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terms of model-data fit and residual analyses. Sixty-two percent of the comparisons considered 
only uniform DIF, whereas the remaining studied both uniform and nonuniform DIF. 

Trends in DIF flagging for the surveyed studies are shown in the following graphs. The 
graphs in Figures 1 to 4 display percentages of low-, medium-, and high-difficulty items flagged 
for DIF. In each graph, each comparison is represented by up to three points depending on 
whether it had a percentage greater than zero of items flagged for DIF in the respective difficulty 
strata. Note that most studies included more than one DIF comparison. Because the tests we 
collected are quite different (e.g., different grade levels, different content, and different item 
difficulty statistics used by the authors), we categorized items by relative difficulty within each 
test. This categorization was done using proportion correct in the reference group for a majority 
of the comparisons and a proxy for proportion correct (e.g., based on the IRT difficulty 
parameter) for the remaining comparisons. Items were sorted by relative difficulty and then 
divided into three categories based on where their difficulty statistic fell compared to other items 
in the test. 

Figures 1 and 2 include DIF comparisons categorized by whether the reference group and 
focal group were tested under the same conditions. Groups that were tested under the same 
conditions were either both administered the assessment under standard conditions or both 
administered the assessment with the same accommodation(s). For a majority of the 
comparisons, testing conditions were clearly defined by the authors; however, there were a few 
comparisons for which testing conditions were unknown, or accommodated and 
nonaccommodated groups were combined. We excluded cases in which the accommodations 
were unknown or when accommodated and nonaccommodated groups were combined. 

As shown in Figures 1 and 2, comparisons involving different administration conditions 
generally had higher percentages of DIF items than comparisons involving the same 
administration conditions. In addition, there appears to be a slight trend of higher percentages of 
easy items flagged for DIF when groups were tested under different conditions. 

As discussed above, when tests are administered under different conditions due to 
offering accommodations to students with disabilities who need them, the influence on the 
quality of DIF results is unclear. Figures 3 and 4 display percentages of items flagged for DIF for 
comparisons that are distinguished by disability status and accommodation use. In Figure 3, 
comparisons involve a nondisabled group and a group of students with disabilities who received 
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accommodations are displayed. Figure 4 shows comparisons involving a nonaccommodated 
group and an accommodated group, with both groups comprised of students with disabilities. 
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Figure 1. Percentage of low-, medium-, and high-difficulty items flagged for differential 
item functioning (DIF) when groups were tested under different conditions. 
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Figure 2. Percentage of low-, medium-, and high-difficulty items flagged for differential 
item functioning (DIF) when groups were tested under the same conditions. 

Both Figures 3 and 4 show that there were higher percentages of easy and medium 
difficulty items flagged for DIF. The majority of comparisons in Figure 4 had relatively low 
percentages of items flagged for DIF. Relative to Figure 1, in Figure 3 there are fewer 
comparisons with higher percentages of item flagged for DIF and fewer comparisons with a high 
percentage of easy items flagged for DIF. This result is expected since the comparisons in 
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Figure 3 involve groups that should be most similar in terms of ability — nondisabled students 
and students with disabilities receiving accommodations. The comparisons in Figure 4 involve 
students with disabilities in both groups in an attempt to improve matching of students in the 
reference and focal groups by comparing groups with similar disabilities. The percentages of 
items flagged for DIF for these studies, which evaluate the impact of accommodations, are 
similar to those in Figure 1, which shows all comparisons under different testing conditions. 
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Figure 3. Percentage of low-, medium-, and high-difficulty items flagged for differential 
item functioning (DIF) in comparisons involving a nonaccommodated, nondisabled group 
and accommodated students with disabilities. 
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Figure 4. Percentage of low-, medium-, and high-difficulty items flagged for differential 
item functioning (DIF) in comparisons involving nonaccommodated students with 
disabilities and accommodated students with disabilities. 
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To further explore DIF results when substantial ability (or estimated ability) differences 
are present between groups, we plotted the percentage of items flagged for DIF versus a measure 
of the effect size for the score distribution difference between groups for a subset of the surveyed 
studies for which summary statistics were available. We used the standardized mean difference 
(SMD)to calculate Cohen’s D, where SMD is computed as 

SMD = * Ref ~* Foc 

I ( n Ref- 1 )* Var /lef+( n Foc -ibVarp oc 
-J n Ref +n Foc 



Using the typical significance cut-off values of 0.2 (negligible), 0.5 (moderate), and 0.8 
(important), Figure 5 shows that the majority of comparisons have at least a moderate difference 
in scores, with many comparisons involving an important or large score difference. Among 
comparisons in which the reference group has higher ability than the focal group (i.e., 
comparisons with positive Cohen’s D values), there is a slight positive relationship (r = .19) 
between the size of the group ability difference and the percentage of DIF items flagged in the 
studies for which we were able to obtain summary statistics. 
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Figure 5. Total percentage of items flagged for differential item functioning (DIF) plotted 
against the standardized mean difference between the empirical score distributions of the 
groups. 
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Another concern discussed above involves the reliability of the matching criterion, which 
is often the total test score. Although it is not the only factor, the number of items strongly 
influences the reliability of the test (i.e., generally, adding more items to a test increases the 
reliability). Figure 6 shows the relationship between the number of items on the test and the 
percentages of items flagged for DIF for the reviewed studies. The correlation between the 
number of items and the percentage that were flagged for DIF was -0.34 in this group of 
comparisons. Because many of the comparisons involved tests with the same number of items, 
the numbers of items are jittered to allow for a clearer display in the graph. There is a general 
trend of a higher percentage of items flagged for DIF when fewer items are included on the test. 




Figure 6. Total percentage of items flagged for differential item functioning (DIF) plotted 
against the number of items on the test. 

Recommendations for Studying Differential Item Functioning (DIF) for Students With 
Disabilities 

The information we obtained from the DIF studies surveyed provided some insight into 
our hypotheses. We found that, in general, comparisons involving focal and reference groups that 
were tested under the same conditions resulted in lower percentages of items being flagged for 
DIF relative to comparisons involving groups that were tested under different conditions. The 
summary also suggested that comparing students without disabilities to students with disabilities 
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who receive accommodation as well as comparing groups with similar disabilities that differ on 
accommodation use may result in meaningful comparisons. Prior research suggests that 
accommodations differ in their effectiveness; as such, some accommodations may be more 
appropriate to study with DIF methods than others. We recommend that DIF studies continue to 
compare students without disabilities to students with disabilities who receive accommodations 
that they need and that have been shown to be effective. However, we urge that caution be taken 
when using DIF as a tool to study the impact of accommodations using a non-experimentally 
designed study. If DIF analyses are to be used to study the impact of accommodations, we 
recommend that an external matching criterion be considered and that decisions based on the 
statistical results be supplemented by expert opinion or existing research on the efficacy of the 
specific accommodation, the impact of accommodation use on the appropriateness of the 
matching criterion, and the amount of construct-irrelevant variance expected to be introduced 
from the interaction between the item characteristics and the accommodation. 

While we were unable to calculate the Type 1 error rate for the comparisons in this study 
since we did not know the truth, many of the comparisons involved observed score mean 
differences or groups that differed in their cognitive ability based on disability subtype and 
accommodation use. We found a slight trend of higher percentages of items being flagged for 
DIF as the ability distributions between the groups became more discrepant. Future DIF studies 
on students with disabilities should take into account the methodological literature and employ 
methods that are most robust to discrepant ability distributions (e.g., Fidalgo & Madeira, 2008; 
Klockars & Lee, 2008; Mazor, Clauser, & Hambleton, 1992; Narayanan & Swaminathan, 1994, 
1996) when the groups of interest are not expected to perform similarly on the total test due to 
their disability. In addition, we caution against combining students with different disability 
subtypes, particularly those who differ in their impact on cognitive ability, and instead support 
the creation of separable, well-defined focal and reference groups that are defined by 
theoretically important research questions rather than sampling convenience. 

Finally, when conducting DIF studies for students with disabilities on assessments with 
few items, we suggest ensuring that the test is sufficiently reliable and exhibits robust 
psychometric properties. Items found to exhibit DIF on such assessments should be carefully 
evaluated by content experts to ensure that results are due to legitimate causes of DIF rather than 
spurious statistical findings. If the technical quality of the assessment is in question due to the 
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small number of items or test format, then we recommend evaluating fairness with methods other 
than DIF analyses and content review, such as cognitive interviews, to understand whether or not 
items are functioning as intended. 

Through our survey of DIF studies for students with disabilities, we found that studies 
varied widely in their design and analysis approach and in how the assumptions of the analysis 
procedures were treated. Consequently, DIF studies for this subpopulation are difficult to 
summarize, making it challenging to synthesize results in order to generalize inferences. The 
lack of uniformity in DIF studies for this subpopulation — and in general — highlights the 
importance of content experts in identifying the practical importance of DIF results and in 
determining whether items and tests are biased. Without the interpretation of results from content 
experts, the interpretation of results comes into question since they are supported by statistical 
analyses that are based on decisions at many phases of the research study that are not backed by 
methodological research. 

The lack of uniformity also highlights the need for guidelines for conducting DIF studies 
in general and for specific subgroups. As noted in the section on standard DIF procedures, there 
are many aspects of DIF for which some rules of thumb or standard practices have been reported: 
reliability and validity of matching criterion, sample size, administration condition, and ability 
distribution. In many studies one or two of these aspects are of particular concern. However, all 
of these aspects are potentially called into question when evaluating DIF in comparisons 
involving students with disabilities. Until these guidelines are created, it is of the utmost 
importance that publications make clear the characteristics of their samples and the analysis so 
that research can be used to accumulate knowledge rather than exist in isolation. In some of the 
studies that we evaluated, this was not the case, and it was difficult to place the study results in 
context. For example, when groups with different accommodations are combined, or the testing 
conditions are unknown, the inability to separate out differential effects may be problematic. 
Similarly, grouping all students with physical disabilities into one focal group ignores the 
heterogeneous nature of the population of students with disabilities. In making inferences based 
on that focal group, for example, one would have to determine whether it makes sense to assume 
that students with visual, hearing, and motor skills difficulties experience similar obstacles 
during testing. This concern can be even greater when the heterogeneous focal group is loosely 
based on cognitive disabilities, which can manifest in very different response profiles. 



16 




Consideration must also be given to the methods used to undertake DIF analysis and whether the 
selected method is robust to violations of assumptions. 

This study explored existing DIF studies on students with disabilities; while no statistical 
tests were performed, we summarized the existing studies in terms of technical characteristics 
and described trends in the percentage of items flagged for DIF. As such, this work provides a 
foundation for further evaluations of DIF studies for students with disabilities. Future research 
can build upon this work by directly evaluating the interaction between item characteristics, 
accommodation use, and students’ disabilities, with the aim of understanding the most 
appropriate ways to evaluate and ensure fairness in testing students with disabilities. 
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Notes 

'An equivalent standard design would have both groups receiving the same accommodation(s) 
under the same measurement conditions. None of the surveyed studies used such a design. 

"Possible sources of evidence include multigroup factor analysis, review of item characteristics 
and their interactions with the accommodation(s), or results from experimentally designed 
research on the impact of specific accommodations. 

In many testing situations, it may be unclear from the data collected whether a test taker used an 
assigned accommodation for a particular studied item. In such cases, providing evidence on 
the comparability of the measurement conditions is even more complex. 

4 The proportion of items flagged for DIF may be a function of effect sizes used to flag items. 

? The large average sample size was due to one study with a number of comparisons with over 
400,000 students in the reference group. 

6 This assessment is a rubric-scored, on-demand performance assessment. 

7 

The DIF comparisons relating to an alternative assessment were excluded from the graphical 
displays because both the student groups and the test administration differ greatly from the 
other studies. 
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